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Abstract 

A collective is a set of self-interested agents which try to maximize their own utilities, 
along with a a well-defined, time-extended world utility function which rates the performance 
of the entire system. In this paper, we use theory of collectives to design time-extended payoff 
utilities for agents that are both “aligned” with the world utility, and are “learnable”, i.e., 
the agents can readily see how their behavior affects their utility. We show that in systems 
where each agent aims to optimize such payoff functions, coordination arises as a byproduct 
of the agents selfishly pursuing their own goals. A game theoretic analysis shows that such 
payoff functions have the net effect of aligning the Nash equilibrium, Pareto optimal solution 
and world utility optimum, thus eliminating undesirable behavior such as agents working at 
cross- pur poses. 

We then apply collective-based payoff functions to the token collection in a gridworld 
problem where agents need to optimize the aggregate value of tokens collected across an 
episode of finite duration (i.e., an abstracted version of rovers on Mars collecting scientif- 
ically “interesting” rock samples, subject to power limitations). We show that, regardless 
of the initial token distribution, reinforcement learning agents using collective-based payoff 
functions significantly outperform both “natural” extensions of single agent algorithms and 
-global- reinfor cement-1 earning ~so lutions“basecPon~“ team-games— — — 


1 Introduction 

There are many problems which can only be properly addressed by having a set of autonomous 
agents act independently and have their joint sequence of actions optimize a pre-set utility func- 
tion. Such systems, called collectives, can be formally defined by the following characteristics: 
there is a collection of self-interested agents which try to maximize their own payoff utility 
functions; and there is a well-defined, time-extended world utility function, which rates the 
performance of the entire system [32, 33]. Furthermore, in general, collectives lack centralized 
communication or control. 
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Examples of problems which require a collective-based approach include control of a constella- 
tion of satellites, construction of distributed algorithms, routing over a data network, and control 
of a collection of planetary exploration vehicles (e.g., rovers on Mars, or submersibles under Eu- 
ropa’s ice caps). For the collective-based approach to work in such problems, two fundamental 
issues need to be addressed: 

1. the agents need to learn the action sequence that will provide good values for their payoff 
utility functions, i.e., agents need to achieve their own goals; and 

2. the agents’ learning their own payoff utilities needs to benefit the world utility, i.e., the 
agents’ utilities need to be “aligned” with the world utility so that the agents do not work 
at cross-purposes, as far as the world utility is concerned. 

The first of these issues has been dealt with extensively in the single agent context; there 
are many reinforcement learning systems [20], (e.g., Q-Iearners [24]) that have successfully been 
applied to real world problems [2, 1]. The second problem has received less attention, and 
generally the solution consists of either each agent receiving the world utility as their payoff 
utility (e.g., “team” games [3]), or of imposing external mechanisms (e.g., contracts, auctions) 
that encourage the agents to work together [8, 9, 16, 17]. 

Addressing these two issues simultaneously, without extensive hand-tailoring is the main 
challenge in designing collectives 1 . In particular, the design problem can be formulated as follows: 
Assuming the individual agents are able to maximize their own utility functions (e.g., through 
reinforcement learning), what set of payoff utilities for the individual agents will, when pursued 
by those agents, result in high world utility? That is, to induce good collective behavior, we 
leverage the assumption that our learners are individually fairly good at what they do, and focus 
on determining what it is that they should do. 

The theory of collectives provides the framework to address such design problems [22, 32]. The 
crux of that theory concerns the derivation of payoff functions that are both “aligned” with the 
world utility and are “learnable” (discussed in detail in Section 2). Factoredness (a formalization 
of the concept of “alignedness”) ensures that an action taken by an agent that improves its payoff 
utility also improves the world utility. Learnability ensures that an agent can discern the effect 
of its own actions on its own utility and select actions that improve that utility. In Section 2 we 
summarize how, given a world utility, one can design payoff utilities that maximize leanability 
-while -restricted to set -of-functions that are -factored - — 

As a naturally occurring example, consider the human economy as a collective. The agents 
are identified with the individuals trying to maximize their payoff utilities (e.g., maximize bank 
account, advance career). An example of a world utility is the time average of the gross domestic 
product (“world utility” is not a construction internal to a human economy, but rather something 
defined from the outside). To achieve high world utility it is imperative that the agents do not 
work at cross-purposes; otherwise frustrational phenomena like the tragedy of the commons, in 
which individual avarice works to lower world utility [7], can occur. Modifications to the agents’ 
utility functions via punitive legislation is one way in which such undesirable phenomena can be 

1 Though hand-tailoring the agents’ payoff utilities may work in some domains, such systems (i) have to be 
laboriously modeled; (ii) provide “brittle” global performance; (iii) are not “adaptive” to changing environments; 
and (iv) generally do not scale well. 
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avoided 2 . Securities and Exchange Commission (SEC) regulations designed to prevent insider 
trading can be viewed as a real world example of an attempt to make such a modification to the 
agents’ utilities. For example, a trade that once may have added to your wealth while hurting 
the economy, may now lead to your prosecution. You are therefore unlikely to make such a trade. 
As a consequence of this new regulation, your payoff utility and the world utility have become 
more aligned. 

The problem of designing a collective is related to work in many fields beyond multiagent 
systems, and includes mechanism design, reinforcement learning for adaptive control, computa- 
tional ecologies, and game theory. However none of these fields directly addresses the inverse 
problem of how to design the agents’ utilities to reach a desirable world utility value in its full 
generality. This is even true for the field of mechanism design, which while addressing an inverse 
problem similar to that of the design of collectives, does so only for certain restricted domains,, 
and does not address the “learnability” issue. For example, mechanism design is mostly ap- 
propriate when there are pre-specified goals underlying agents’ utilities over which “incentives” 
need to be provided, and when Pareto-optimality (rather than optimization of a world utility) is 
often the goal [31]. We discuss the connection between the Groves mechanism, the AG V- Arrow 
mechanism and collective-based design in Section 2. 

To date, the theory of collectives has been successfully applied to various domains, including 
packet routing over a data network [33], the congestion game known as Arthur’s El Farol Bar 
problem [34], and data downloads from a constellation of satellites [30]. In particular, in the 
routing domain, the collective-based approach achieved performance improvements of a factor of 
three over the conventional Shortest Path Algorithm (SPA) routing algorithms currently running 
on the internet [29], and avoided the Braess’ routing paradox which plagues the SPA-based 
systems [22]. 

In the work described above, agents were concerned with optimizing “rewards” (i.e., utility 
value of a single time step). In this paper we extend these results to a problem where agents need 
to optimize a time-extended utility function through selecting sequences of actions. We show that 
in this significantly more complex domain, agents that use collective-based utilities provided so- 
lutions that are significantly superior to agents that either use team games or “natural” utilities. 

In Section 2, we provide a summary of the theory of collectives and discuss its relationship to 
mechanism design. In Section 3, we describe the gridworld problem domain and develop a col- 
lective based solution to the design of agents’ payoff utilities. In Section 4, we present simulation 
- results -that-show-that collective- based ^solutions significantly outperform _both..truditiouai_and _ _ 
more “natural” solutions. Finally, in Section 5, by studying the Nash equilibrium of a simple 
system, we demonstrate how collective-based algorithms achieve performance unattainable by 
systems using “selfish” payoff functions. 

2 In this context we say we “modify” the agents’ initial utility (/i) when we add incentives (/2). From an 
economics perspective where an agent’s utility is considered immutable, one would refer to this operation as 
adding incentives to the agent’s utility. Whether one calls the function g = fi + f 2 the agents’ new utility, or 
whether the agents’ utility is still f\ but it is now subject to incentives /2 is nothing but a semantics issue. 
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2 Background: Theory of Collectives 


In this section, we summarize the portion of the theory of collectives necessary and sufficient to 
describe the learning of sequences of actions in a distributed system [ 27 ]. Let Z be an arbitrary 
vector space whose elements £ give the joint move of all agents in the system (i.e., £ specifies the 
full “worldline” consisting of the actions/states of all the agents). The provided world utility 
G(£), is a function of the full worldline, and we wish to search for the £ that maximizes G(£). 

In addition to G, for each agent 77, there is a set of payoff utility functions {^ } . The 
agents will act to improve their individual payoff functions, even though, we, as system designers 
are only concerned with the value of the world utility G . To specify all agents other than 77 , we 
will use the notation 77. 

2.1 Intelligence 

We need to have a way to “standardize” utility functions so that the numeric value they assign 
to a £ only reflects the ranking of £ relative to certain other elements of Z. We call such a 
standardization of some arbitrary utility U for agent 7? the “intelligence for 77 at £ with respect 
to U ” . Here we use intelligences that are equivalent to percentiles: 

ec/ (< : V ) = J d^ (c) Q[U (0 - U( 0 ] , (1) 

where the Heaviside function 0 is defined to equal 1 when its argument is greater than or equal 
to 0, and to equal 0 otherwise, and where the subscript on the (normalized) measure d\x indicates 
it is restricted to £' sharing the same non-77 components as £. 3 Note that intelligence value are 
always between 0 and 1 . Intuitively, intelligence values indicate what percentage of 77 ’s actions 
would have resulted in lower utility. Accordingly, e gr} (( : 77) = 1 means that agent 77 is fully 
rational at £, in that its move maximizes its payoff, given the moves of other agents. Figure 1 
shows an example where 60 % of 77 ’s actions would have resulted in worse utility, giving 77 an 
intelligence of 0.6 at that point (£). 

Our uncertainty concerning the behavior of the system is reflected in a probability distribution 
over Z . Our ability to control the system consists of setting the value of some characteristic of 
the collection of agents, e.g., -setting- the payoff -functions of- the agents. -Indicating -that-vaiueJxy 
s , our analysis revolves around the following central equation for P{G | s), which follows from 
Bayes’ theorem: 

P{G | s)= J de G P(G | eb, s) J de g P(e G | e g ,s)P(e g | s) , (2) 

where e g = (e 9vi (£ : 77 i),e Sl?2 (£ : 772), * * *) is the vector of the intelligences of the agents with 
respect to their associated payoff functions, and cq = (cg(C : ^i)j ^g(£ • ^2)5 * * •) is the vector of 
the intelligences of the agents with respect to G. 

3 The measure must reflect the type of system at hand, e.g., whether Z is countable or not, and if not, what 
coordinate system is being used. Other than that, any convenient choice of measure may be used and the theorems 
will still hold. 
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Figure 1: Intelligence of agent 77 at state ( for utility U: ( is the actual joint move at hand. 

The x-axis shows agent 77’s alternative possible moves (all states £' having £ J s values 
for the moves of all agents other than 77.). The bold lines show the alternative moves 
that 77 could have made that would have given r] a worse value of the utility U . The 
fraction of those bold lines to the full set of rfs possible moves (which is 0.6 in this 
example) is the intelligence of agent 77 at £ for utility U , denoted by eu(( ' rj). 

Note that, from a game-theoretic perspective, a point £ where all players are rational, (e 5r? (£ : 
7]) = 1 for all agents 77, is a game theory Nash equilibrium [4]. On the other hand, a £ at which 
all components of ec — 1 is a local maximum of G (or more precisely, a critical point of the G(£) 
surface). 

If we can choose s so that the third conditional probability in the integrand, P(e g | s), is 
peaked around vectors e g all of whose components are close to 1 (that is agents are able to 
“learn” their tasks), then we have likely induced large payoff utility intelligences. If we can also 
have the second term, P(eo | e 3 ,s), be peaked about ?g equal to e g (that is the payoff and 
world utilities are aligned), then cq will also be large. Finally, if the first term in the integrand, 
P(G | eb>s), is peaked about high G when eo is large, then our choice of s will likely result in 
high G, as desired. 

2.2 Factoredness and Learnability 

The requirement that payoff functions have high “signal-to-noise” (an issue not considered in 
conventional work in mechanism design) arises in the third term. It is in the second term that 
the requiremenTthat tTe payoffTuhctions’Te~ “aligne arises . Tn thBTvbr £ we concentrate 

on these two terms, and show how to simultaneously set them to have the desired form. 4 

Details of the stochastic environment in which the collection of agents operate, together with 
details of the learning algorithms of the agents, are reflected in the distribution P(£) which 
underlies the distributions appearing in Equation 2. Note though that independent of these 
considerations , our desired form for the second term in Equation 2 is assured if we have chosen 
payoff utilities such that e g equals 6 q exactly for all £. We call such a system factored. In game 

4 Non-game theory-based function maximization techniques like simulated annealing instead address how to 
have term 1 have the desired form. They do this by trying to ensure that the local maxima that the underlying 
system ultimately settles near have high G , by “trading off exploration and exploitation”. One can combine such 
term-l-based techniques with the techniques presented here, The resultant hybrid algorithm, addressing all three 
terms, outperforms simulated annealing by over two orders of magnitude [28]. 


theory language, the Nash equilibria of a factored system are local maxima of G. In addition to 
this desirable equilibrium behavior, factored systems also automatically provide appropriate off- 
equilibrium incentives to the agents (an issue rarely considered in the game theory / mechanism 
design literature). 

As a trivial example, any "team game” in which all the payoff functions equal G is factored [3, 
14]. However team games often have very poor forms for term 3 in Equation 2 , forms which get 
progressively worse as the size of the system grows. This is because for large systems where 
G' s sensitively depends on all components of the system, each agent may experience difficulty 
discerning the effects of its actions on G. As a consequence, each r\ may have difficulty achieving 
high g v in a team game. We can quantify this signal/noise effect by comparing the ramifications 
on g v (() arising from changes to with the ramifications arising from changes to (i.e., 
changes to all nodes other than rj). We call this quantification the differential learnability of 
payoff utility g Vl in the vicinity of £ [31]: 


HVc^OII ' 


( 3 ) 


The denominator in Equation 3 reflects how sensitive g v (() is to changes in £^, or changes to 
agents other than r 7 . In contrast, the numerator reflects how sensitive g v ( £) is to changing £^. 
So at a given state £, the higher the learnability, the more g v (C) depends on the move of agent 
7 7 , i.e., the better the associated signal-to-noise ratio for 77. Intuitively then, higher learnability 
means it is easier for 77 to achieve a large value of its intelligence. 


2.3 Difference Utilities 

It is possible to solve for the set of all payoff utilities that are factored with respect to a particular 
world utility. Unfortunately, in general it is not possible for a system both to be factored and to 
have infinite learnability (i.e., no dependence of any g v on any agent other than 77) for all of its 
agents [27]. However, consider difference utilities, which are of the form: 

mo = ( 7(0 - r(/(0) , (4) 

where T(f) is independent of ( n . Such difference utilities are factored [31]. In addition, under 
usually benign approximations, the differential learnability can be maximized over the set of 
difference utilities' b^ choosing CTand setting' T to" thWexpecfed value dperatoF[31]. We call 
the resultant difference utility the Aristocrat Utility (AU), loosely reflecting the fact that it 
measures the difference between an agent’s actual action and the average action: 

ma = G(Q-E(G\C^8). (5) 

Each agent rj using its AU as its payoff function, theoretically ensures good form for both 
terms 2 and 3 in Equation 2 . Achieving this however, is not always feasible. The problem is 
that to evaluate the expectation value defining its AU, each agent needs to evaluate the current 
probabilities of each of its potential moves. However if the agent then changes its payoff function 
to be the associated AU, it will in general substantially change its ensuing behavior. (The agent 
now wants to choose moves that maximize a different function from the one it was maximizing 
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Figure 2: This example shows the impact of clamping on the joint state of a four- agent system 
where each agent has three possible actions, and each such action is represented by 
a three-dimensional unary vector. The first matrix represents the joint state of the 
system £ where agent 1 has selected action 1, agent 2 (shown in bold for emphasis) has 
selected action 3, agent 3 has selected action 1 and agent 4 has selected action 2. The 
second matrix displays the effect of clamping agent 2’s action to the “null” vector (i.e., 
replacing with 0). The third matrix show's the effect of instead clamping agent 2’s 
move to the “average” action vector a = {.33, .33, .33}. which amounts to replacing 
that agent’s move with the “illegal” move of fractionally taking each possible move 
(C/72 ~ &)* 


before.) In other words, it will change the probabilities of its moves, which means that its new 
payoff function is in fact not the AU for its actual (new) probabilities. 

There are ways around this self-consistency problem (including setting the probabilities of 
taking the possible actions to fixed values as discussed in Section 3), but in practice it is often 
easier to bypass the entire issue, by giving each 7 ] a payoff function that does not depend on 
the probabilities of rj ’s own moves. Before defining a family of such payoff utility functions, let 
us define the clamping parameter CL* as the fictitious state where agent rf s state has been 
clamped to the pre-fixed vector v. The state to which 77 is clamped can be chosen from 77 ’s 
possible states, or it can be an arbitrary vector. In the latter case, care must be taken to ensure 
that-G- is.-defined over _-_or -extended. to. ~_.the. 'illegal” .clamp.ed_states...Eigure _2 shows an exam ple 
of clamping when the state of each of the four agents is a unary action vector for each time step. 

Now, let us define a family of payoff functions, called Wonderful Life (WL) utility functions. 
For each agent 77 , the WL utility parameterized by the vector v to which agent 77 is clamped is 
given by: 


wlu v = g(0~g(c^cl;). ( 6 ) 

Note that, regardless of the choice of the clamping parameter, WL utilities are of the form 
given in Equation 4, and therefore are factored. Furthermore, it can be proven that in many 
circumstances, especially in large problems, A v ,WLtr(0 > A v ,g(Oi be., WLU has higher differ- 
ential learnability than does the team game choice of payoff utilities [31]. This is mainly due 
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to the second term of WLU, which removes a lot of the effect of other agents (i.e., noise) from 
7]’s utility. The result is that though WLU does not theoretically match the high learnabiiity of 
AU, in practice convergence to optimal G with WLU is much quicker (up to orders of magnitude 
so [31, 32]) than with a team game. 

Though all WL payoff utilities are factored regardless of the choice of v ) the selection of 
v affects the learnabiiity of the payoff function. Therefore, in practice matching the proper 
clamping parameter to the domain greatly improves the performance of the system [31]. In 
many circumstances there is a particular choice of clamping parameter for agent 7/ that is a 
“null” move (i.e., v — 0) for that agent, equivalent to removing that agent from the system, 
hence the name of this payoff function. For such a clamping parameter WLU is closely related 
to the economics technique of “endogenizing a player’s (agent’s) externalities” [15]. Indeed, 
WLU has conceptual similarities to Vickrey tolls [23] in economics, and Groves’ mechanism [6] in 
mechanism design 5 . However, because WLU can be applied to arbitrary, time-extended utility 
functions, and need not be restricted to the “null” clamping operator interpret able in terms of 
“externality payments”, it can be viewed a generalization of these concepts. 

For example, it is usually the case that using WLU with a clamping parameter that is as close 
as possible to the expected action (denoted by a), and not the “null” action, results in higher 
learnabiiity than does clamping to the null move. For example, in Figure 2, if the probabilities of 
agent 2 making each of its possible moves is 1/3, then one can expect that setting the clamping 
parameter to a would result in maximizing learnabiiity. Accordingly, in practice, use of such an 
alternative WLU derived as a “mean-field” approximation 6 to AU almost always results in better 
values of G than does the “endogenizing” WLU [32]. 

Furthermore, though a conceptual similarity between AU and the AGV-Arrow mechanism [4] 
appears to exist, there is a fundamental difference: In the AGV-Arrow mechanism, one averages 
the actions of “other” players conditioned on one’s own action, i.e., one gets a sense of the 
“average background”. This is fundamentally different than the operation performed by AU 
which is to average one’s own actions, based on others’ actions 7 , and subtract the resulting world 
utility from the actual world utility. This provides a difference utility that measures the “impact” 
of taking a particular action, given everyone else’s state. 

Intuitively, one can look at AU and WLU from the perspective of a human company, with G 
the “bottom line” of the company, the agents rj identified with the employees of that company, 
and the associated g n given by the employees’ performance-based compensation packages. For 
example, “for a ''“factored '‘company” , each" employee’s compensation package contains -incentives 
designed such that the better the bottom line of the corporation, the greater the employee’s 
compensation. As an example, the CEO of a company wishing to have the payoff utilities of the 
employees be factored with G may give stock options to the employees. The net effect of this 

3 Note also that Groves’ mechanism is restricted to public resources where an agent’s use of that resource 
does not affect the ability of other agents to use the resource (e.g., a lighthouse) and therefore the tragedy of 
the commons does not arise. Grove’s mechanism was especially formulated to solve the problem of agents being 
untruthful in reporting their utility for public goods, a problem not present in the collectives framework. 

6 Formally, our approximation is exact only if the expected value of G equals G evaluated at the expected 
joint move (both expectations being conditioned on given moves by all agents other than 77 ). In general though, 
for relatively smooth G, we expect such a mean-field approximation to AU, to give good results, even if the 
approximation does not hold exactly [32]. 

7 In this analogy, we equate state to action, e.g., an agent’s action uniquely determines its state. 



action is to ensure that what is good for the employee is also good for the company. In addition, 
if the compensation packages have “high learnability” , the employees will have a relatively easy 
time discerning the relationship between their behavior and their compensation. In such a case 
the employees will both have the incentive to help the company and be able to determine how 
best to do so. Note that in practice, providing stock options is usually more effective in small 
companies than in large ones. This makes perfect sense in terms of the collectives formalism, 
since such options generally have higher learnability in small companies than they do in large 
companies, in which each employee has a hard time seeing how his/her moves affect the company’s 
stock price. 


Figure 3: Agents collecting tokens of varying value. 



3 Multi-agent Grid World Problem 

3.1 Problem Description 

A common reinforcement learning problem is the Grid World Problem [20], where an agent 
navigates about a two-dimensional n x n grid. At each time step, the agent can move up, down, 
right or left one grid square, and receives a reward after each move. The observable state space 
for the agent is its grid coordinate and the reward it receives depends on the grid square to 
which it moves. In the episodic version, which is the focus of this paper, the agent moves for a 
fixed number of time steps, and then is returned to its starting location. This problem typically 
requires the use of a reinforcement learner that can optimize a sum of rewards in contrast to one 
that optimizes an immediate reward, since the agent may have to cross squares of low reward 
value to enter the squares of high value. Q-learners or the Sarsa algorithm [20] are often used 
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for this problem. 

In this paper we apply the theory of collectives to a multi-agent version of the grid world 
problem. In this instance of the problem there are multiple agents navigating the grid simul- 
taneously interacting with each others’ rewards. This reward interaction is modeled through 
the use of tokens that are distributed throughout the grid squares of the gridworld (Figure 3 ). 
Each token has a value between zero and one, and each grid square can have at most one token. 
When an agent moves into a grid square it receives a reward for the value of the token and then 
removes the token so that a reward will no longer be received when another agent enters the grid 
square. However, all the tokens are reset at the end of an episode. The global objective of the 
Multi-agent Grid World Problem is to collect the highest aggregated value of tokens in a fixed 
number of time steps. 

The Multi-agent Grid World Problem is an idealized version of many real world problems, 
including the control of multiple planetary exploration vehicles (e.g., rovers on the surface of 
Mars, collecting rocks in an attempt to maximize total scientific return, submersible under Europa 
examining potential life signs). Furthermore, the agent interaction provides a critical study of 
coordination and interference, as the agents have the potential to work at cross-purposes. This 
problem is particularly interesting in a multi-agent setting because each agent attempting to 
maximize the value of the tokens it collects, can drive the world utility to severely sub-optimal 
values. As such, the design of the payoff functions is crucial in this problem, and we address this 
issue below. 

3.2 Collective-Based Solution 

To pose the Multi-agent Grid World Problem in the form of the collective framework we need to 
define: 

• L Vit : The matrix representing the location of an agent. If agent r\ at time t is in location 
(x,y) } then L Vi t iXi y — 1 ; otherwise L v j }X} y = 0. Furthermore, {L Vt t} denotes the set all the 
agents’ location matrices. 

• L^ t : The location matrix agent 77 would have had at time £, had it taken action a at time 
step t — 1. 

• L v : The l oca tion matrix of agent rj across all time (L^ = Yt^in A' 

• The location matrix of agent 77 across times less than t ( T 7j ( < t 'Et'ct 

• Lf : The location matrix of all agents at time t (L t — Y v 

• L: The location matrix of all agents across all time (L — Yt — Yt L Vf t). 

• L <t •' The location matrix of all agents across times less than t (L <t = YvKt^t — 

2 \ 2 t‘<t Stj Lrj 

• L^: The location matrix of all agents other than 77 across all time (L- v = L — L v ). 

• L^^t: The location matrix of all agents other than 77 across times less than t = 

L<t — Lr ) , < £ ) • 
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• 0: The initial token value matrix (e.g., Q XtV contains the initial value of the token at 
location (x,y)). 

The space Z is composed of 0 and the set of all possible location matrices, {L v ,t}, given the 
length of an episode. A worldiine £ is a point in this space, i.e., the combination of the token 
configurations 0, along with a particular set We now define the function V(L, 0) which 

returns the value of a token received from a location matrix. Formally: 


V(L, 0) = 0 X)J/ mm(l, L XiV ). (7) 

The global utility G(Q is the sum off all the tokens collected during an episode: 

G(0 = V(L,Q). ( 8 ) 

Based on the definitions and world utility given above, let us now derive the collective-based 
utility functions for this domain. In this formulation, the AU (given in Equation 5) become: 

AU V ( 0 = G( 0 - E P* V ( L * + L % 0 ) ( 9 ) 

where A n is the set of possible action sequences agent 77 can take. The second term in the equation 
is the expected value of the global utility over all the possible actions of agent 77 . 

Now, let us formulate the WL utilities for this domain. First, setting the clamping parameter 
CLrj to the null vector, we obtain WL utility where the agent is removed from the worldline: 

WLU°(0 = G(C) - V(L-„ 0). (10) 

This utility returns an agent’s contribution to the world utility. Note, this utility differs from one 
where the values of the tokens present in the locations visited by the agent are summed (i.e., a 
selfish utility). WLU® gives the value of the tokens in locations not visited by other agents , i.e., 
the values of token that would not .have been picked up had agent 77 not been in the system. 

Next, let us define the WL utility resulting from agent 77 taking the fictitious average action, 
where it partially takes all possible actions: 

WLUft 0 = G( 0 - V(L v + £ Ps L*, 0) (11) 

S€A V 

Because these utilities are based on the performance on a full episode, they are problematic to 
work with directly. We therefore introduce single time step “rewards” that will help in learning 
the set of actions (e.g., through Q-learners or Sarsa learners) that will lead to good values for 
the utility. Note, that the utilities will be undiscounted sums of these rewards. To that end, first 
let us decompose an arbitrary utility U in the following manner: 

U(L) = J2U(L <t +i)-U(L <t ). (12) 
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Now, let us define the single time step reward R t by: 


Rt{L) = U(L <t +i) - U(L <t ) (13) 

Now we can generate the four single time step reward versions of the four utilities 8 : 

GR t (0 = V (£<t+i , 0) — V (L <t , 0) (14) 

AR n , t (0 = GR t (0 - £ p s (V {L n , <t+l + Ll <t+1 ,e) - V (Z*< t + L % <t> 0)) (15) 


WLRl t ( 0 = GR t (0-(V(L- n , <t+1 ,Q)-V(L- nt<t ,Q)) 

WLRl t (0 = GRt(C) ' 


m 


V 




£ r T},<t+l + Pa^ri^t+li ® 

at-Art 

~ v ( L v ,<t + £ PaL“ <t ,0 




(17) 


Note that as expressed above the formulation for AR and WLR a has significant drawbacks. 
First, the set of all possible action sequences is very large, and grows exponentially with t. 
Second, without prior information, the average path contains little information and for WLR 
is similar to clamping to zero. To side-step these issues, we make the approximation that each 
action is equally likely, and compute the average action over the actions available to the agent 
in the previous time step: 


ARr,,t(0 
WLRl t {Q = GR t ( 0- 


GR t (0~ £ Po (V(^, <t+1 +X“ <t+1 ,©)-K (£»,,<* + L“ <t ,0)) 

aOzAri't - 1 


M £'»,<«+ 1+ £ PaL“ t<t+1 ,Q 

( 


- V 




V 


*eA„ *_i 


Note, the average action sequence has been replaces with the sequence of average actions, 
that is“Ihe“ sequence - where' at each'Time - step the average'action has been taken^ Because^of the 
arbitrariness of the clamping operator (see discussion in Section 2) and the fact that theoretically, 
clamping to any fixed vector results in a factored system, this approximation is milder for WL a 
than it is for AU . 


4 Experimental Results 

To evaluate the effectiveness of the collective-based approach in the Multi-agent Grid World, we 
conducted experiments on three different types of token distributions. The payoff utility function 

8 In the actual implementation there are some tie breaking rules if more than one agent goes into the same 
square at the same time. 
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Figure 4: Distribution of Token Values in “Corner” World 


investigated included: (i) the Selfish Utility (SU), where each agent receives the weighted total 
of the tokens that it alone collected. It is the natural extension of the single agent problem, and 
represents the optimal utility in the single rover domain; (ii) the Team Game (TG) utility where 
each agent received the full world utility; (iii) the W L° utility, where there clamping parameter is 
set to 0. Intuitively, this utility computes the contribution an agent makes to the token collection, 
by looking at the difference in the total token collection with and without that agent; (iv) the 
WL a utility, where there clamping parameter is set to a, representing the difference between in 
utility value resulting from an agent’s actual action and its “smeared” action; and (v) the AU, 
where the agent’s contribution is computed as the difference between the action it took and its 
expected action. 

In the following subsections we evaluate the five payoff utilities using three different distribu- 
tions of tokens. Since we are unaware of any standard gridworld benchmarks that specify reward 
distributions, we created artificial distributions of tokens that could illustrate the important col- 
lective learning issues. The first set of tokens is the most “unnatural,” but requires the agents to 
optimize a sum of rewards instead of an immediate reward, while working together, if high world 
utility is to be~ achieved! The sTcdhd Tet dfToken^ ha^ similar "pr operties^tcTthe first set ,Tut has 
smoother transitions in token values. The final set is randomly generated from Gaussian kernels 
on every trial, to illustrate that the collective-based principles still hold on less hand crafted 
token distributions. 

4.1 Corner World Token Value Distribution 

The first experimental domain we investigated consisted of a world where where the “highly 
valued” tokens are concentrated in one corner, with a second concentration near the center 
where the rovers are initially located. The rest of the world is uniformly filled with tokens of 
little importance. Figure 4 conceptualizes this distribution for a 20x20 world. 
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(a) 40 Rovers on a 20x20 grid. 
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Number of Training Episodes 

(b) 100 Rovers on a 32x32 grid. 


Figure 5: Effect of Payoff Utility on System Performance. 


Figure 5(a) shows the performance of the different payoff utilities for 40 agents on a 400 unit- 
square world for the token value distribution shown in Figure 4, and where an episode consists 
of 20 time steps (error bars of ± one a are included, though they are smaller than the symbols). 
The performance measure in these figures is “normalized” world utility given by ? where 

l n is the n x n matrix of ones. This normalized utility provides the fraction of token values that 
was collected by the agents (a value of 1 means all available tokens were collected). 



10 20 30 40 50 60 70 80 90 100 

Number of Rovers 


Figure 6: Scaling Properties of Different Payoff Functions. 

The results show that SU produced poor results, results that were indeed worse than random 
actions. This is caused by all agents aiming to acquire the most valuable tokens, and congregating 
towards the corner of the world where such tokens are located. In essence, in this case agents 
using the SU payoff competed, rather than cooperated with one another. The agents using TG 
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fared marginally better, but their learning was slow. This system was plagued by the signal- 
to-noise problem associated with each agent receiving the full world reward for each individual 
action they took. Notice both the selfish agents and those trained with TG had a drop in their 
performance in the early going, as they learned the “wrong” actions (as far as the world utility is 
concerned). Agents using TG payoffs overcame this early setback whereas selfish agents never did. 
In contrast, agents using WL° and AU performed very well, and agents using WL a performed 
almost optimally. In each of these three cases, the reinforcement signal the agents received was 
both factored and showed how their actions affected the world reward more clearly than did the 
TG reinforcement signal. 

For a scaled up version of the same token value distribution, Figure 5(b) shows the results for 
100 agents on a 1024 unit-square grid, where an episode consists of 32 time steps. Qualitatively, 
the results are similar to the 40 agent case. However, note that the team game agents have 
a harder time learning, because in this case the reinforcement signal is even further diluted. 
Furthermore, the performance of WL a is now clearly superior to that of WL°, showing that 
using the degree of freedom given by the clamping parameter provides significant improvements 
over solutions aimed at “endogenizing externalities” or similar to the Groves mechanism (WL°). 

Figure 6 explores the scaling issue in more detail. As the number of agents was increased, we 
kept the difficulty of the problem the same by increasing the size of the grid world, and allocating 
more time for an episode. Specifically the ratio of the number of agents to total number of grid 
squares and the ratio of the number of agents to total value of tokens was held constant. In 
addition the ratio of the furthest grid square from the agents’ starting point to the total amount 
of time in an episode was also held constant. The scaling results show that agents using TG 
payoffs were not hampered as much by the noise associated with other agents when the number 
of agents was low. As the system scaled up however their performance deteriorated rapidly. 
Agents using WL and AU payoffs on the other hand were not strongly affected by the increase 
in the size of the problem. This underscores the need for a payoff utility function that has good 
signal-to- noise properties so that in large systems, the agents have an opportunity to learn the 
actions that will optimize their payoff utilities. 

4.2 Incline World Token Value Distribution 

In the second experimental domain we investigated a w r orld where the “highly valued” tokens are 
•still concentrated in one .corner, -but where.there .is a “ridge” of moderately high v alue s along a 
side, starting from the opposite corner. The actual distribution for the token values on a 
two dimensional (x,y) map is given by: 

(18) 

where s is the one dimensional “size” of the map (i.e., s — 20 for 40 agents, and s — 32 for 100 
agents). Figure 7 conceptualizes this distribution for a 10x10 world (s=10). 

Figure 8 shows the performance of the different payoff utilities for 40 agents and 100 agents 
on 400 and 1024 unit-square worlds, respectively, on the token value distribution given by Equa- 
tion 18. 
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World Utility 



Figure 7: Distribution of Token Values in “Incline” World 
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(a) 40 Rovers on a 20x20 grid. (b) 100 Rovers on a 32x32 grid. 

Figure 8: Effect of Payoff Utility on System Performance for Incline World Token Value Distri- 
bution. 


16 





The results show that the WL payoff is unaffected by the change in the token value distri- 
bution. Both TG and SU payoffs in contrast perform better in this case, showing a much larger 
sensitivity to the token distribution. The improvements in SU are easily explained: The area sur- 
rounding the high token values contains sufficiently many tokens that even when the SU agents 
are all trying to reach the high valued tokens, they help the world utility. 

Though agents using TG collect a larger portion of token as compared to the previous token 
configuration, the lack of improvement displayed in the system where agents use TG payoffs is 
noteworthy. Because of the noise in the system, these agents do not even learn to “walk” in the 
right direction in the allotted number of episodes. 

4.3 Random World Token Value Distribution 

In the final set of experiments, we investigate the behavior of agents in a gridworld where the 
token values are randomly distributed. In this world, for N agents, there are N/3 Gaussian 
‘attractors’ whose centers are randomly distributed. Figure 9 shows an instance of the gridworld 
using this distribution for the 20x20 world, used in the experiments with 40 agents. 



Figure 9: Distribution of Token Values in “Random” World 

Figure 10 shows that the performance of the agents in the “random” world are very similar 
to the “incline” world, but for the poorer performance of the SU payoff function. This can be 
explained by the token value distributions: there are many “dry” patches, and agents aiming 
for the high valued token do not necessarily get the consolation of mid- valued tokens. The WL 
payoff again does well for both clamping parameters. 

This experiment illustrates that when agents need to use a “divide and conquer” approach, the 
selfish utility performs poorly. Furthermore, these experiments illustrate that the collective-based 
payoff functions are superior across a wade range of token distributions ranging from smooth to 
irregular to random. 
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(a) 40 Rovers on a 20x20 grid. (b) 100 Rovers on a 32x32 grid. 

Figure 10: Effect of Payoff Utility on System Performance for Random World Token Value 
Distribution. 

5 Nash Equilibrium and World Utility Optimum 

Figure 11 illustrates a simple example with two agents, a six square world, and where each agent 
can choose to move left or right for two time steps. There are two tokens, one of values 5 and 
the other of value 10 that the agents can pick up by entering the appropriate square. 

Consider the joint set of moves where agent 1 moves right twice, and agent 2 moves left 
twice 9 . In this scenario agent 2 picks up a token worth 10 units on the first time step and 
nothing. Agent 1 does not pick up any tokens. Figure 11 summarizes the reward and utility 
values associated with this move. Agent 1 receives an SU of 10 (10 for the first step, 0 for the 
second) whereas agent 2 receives an SU of 0. This results in a world reward of 10 for the first 
time step and 0 for the second, resulting in a world utility of 10. Incidentally, this is the solution 
the system settles into if the agents are indeed using SU as their payoff utilities. For SU, this is 
a Nash Equilibrium: There is no unilateral moves that a player can make that will improve its 
utility. 

Now, let us analyze the ~WL Q payoff utility for agent 2 for this set of moves: 10 -For the first- 
time step, the WL reward is the same as SU: agent 2 receives 10 for picking up the token. It 
is in the second time step though the differences emerges: The first term of WL as given in 
Equation 17 (i.e., the world reward for time step 2) is 0 for this time step as no tokens are picked 
up. However, in the worldline where agent 2 has been clamped to zero, the first parameter of 
the V function, L- n , does not include the locations of agent 2, causing this function do disregard 
any tokens agent 2 previously picked up. This causes agent 2 to credit agent 1 for picking up the 
token of value 10 in the second time step. Agent 2 then computes a value of 10 for the world 
reward of this state where it’s clamped its action to 0. The WL reward for agent 2 for this time 

9 The second step of agent 2 is arbitrary. 

10 Both AU and WL with damping to a provide similar results, but we omit them for clarity. 
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step is then given by: WLR^^ = 0 — 10 = — 10. 
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Figure 11: WLU Nash Equilibria and World Utility Optima 

Now the time-extended WLU for agent 2 can be computed by summing the WLRs. This 
results in a WLU of 0, even though agent 2 picks up a token weighted 10 (10 for t=l and -10 
for t=2). The interpretation for this “counter-intuitive” utility value is clear: because that token 
would have been picked up by agent 1 at another time step, the net effect of agent 2’s actions on 
the world utility was nil, resulting in a WLU value of 0 for agent 2 for this set of actions. Because 
moving to the right twice provides a WLU value of 5 for agent 2 [asTlet ailed m"Figufe IT)~ an 
agent optimizing its WL payoff utility will take this second action. Similarly agent 1 moving 
right twice will receive a WLU of 10. As this simple example shows, each agent maximizing its 
W"LU leads the system to the world utility maximum where both tokens are picked up. 

Let us analyze the game-theoretic equilibrium states for WLU and SU in these two solutions: 
The SU is in a Nash equilibrium for the first set of moves, in that neither agent can improve 
its SU by unilaterally changing its actions. Therefore, the system is “stuck” in this suboptimal 
solution. Furthermore, even if the agents stumble upon the second solution by accident, they 
will not remain there, as this solution is unstable with payoff utilities given by SU: Agent 2 can 
change its move (in future episodes) and improve its payoff utility from 5 to 10. That this move 
reduces agent l’s utility from 10 to 0, and the world utility from 15 to 10 has no influence on 
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agent 2 J s actions. Furthermore, this solution is also Pareto-optimal, in that there is no set of joint 
moves that improves the utility of both agents. This example shows a simple case where Pareto 
optimality and optimum of the world utility do not necessarily coincide 11 , and simply seeking a 
Pareto-optimal solution will not necessarily lead to high values of the world utility function. 

On the other hand agents whose payoff utilities are given by the WLU are in a Nash Equilib- 
rium in the second set of actions. They will therefore seek this solution as it offers higher payoff 
utilities for each agent. The use of WLU has the net effect of "aligning” the Nash equilibrium of 
the agents with the world utility optimum, ensuring that when the agents optimize their payoff 
utilities, the world utility is also at a local - and in this case also the global - optimum. 


6 Discussion 

In this work we focus on the problem of designing a collective of autonomous agents that in- 
dividually learn sequences of actions such that the resultant sequence of joint actions achieves 
a predetermined global objective. In particular we discuss the problem of controlling multiple 
agents in a gridworld, a problem related to many real world problems including exploration 
vehicles trying to maximize aggregate scientific data collection (e.g., rovers on the surface of 
Mars). 

This paper illustrates how the theory of collectives can be used to leverage the work done 
on existing reinforcement learners that are able to w^ork with sequences of actions, so that they 
can be extended to multi-agent problems. In this domain, we addressed the critical issue of 
what utility functions those agents should strive to maximize. We extended previous results 
on collective intelligence to agents attempting to maximize sequences of actions, and used Q- 
learning with rewards set by the theory of collectives. Our results demonstrate that agents 
using collective-derived goals outperform both “natural” extensions of single agent algorithms 
and global reinforcement learning solutions based on “team games” . 

Even the simplest collective-derived utility, WL°, showed marked improvement over greedy 
and team game utilities as it was able to scale well, while still directing agents to “work to- 
gether.” To try to increase performance further, this paper presented two utilities in addition 
to WL WL d and AU. While WlP performed well in these problems, WL a provided further 
improvements, and approached the optimal solution in many cases. This improvement was due 
to WL d retu rni ng an_agents contribution compared to an averag e ac tion rather th an t o the more 
extreme case comparing it to no action at all. The experimental results confirm the theoretical 
analysis that shows WL d having a higher learnability than WL°. 

While WL d proved to be a superior utility, our investigations revealed an interesting sit- 
uation where AU, the theoretically “best” strategy, was not necessarily the best approach in 
practice. Although AU is theoretically superior to WLU (higher learnability), two issues prevent 
us from fully exploiting its power: First, the “expected” action is impossible to compute in a 
time extended setting, since even a simple case where an agent has four actions and ten time 
steps leads to 4 10 possible actions. Even Monte Carlo sampling of such a space will yield highly 

11 Note, in this case the world utility optimum is also a Pareto-optimal, but there are no guarantees that the 
system will stumble upon the “desirable” Pareto optimal solution. 
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inaccurate estimates of the potential actions and their rewards. Second, estimating the correct 
probability distributions over the possible actions causes the utility values to change, creating a 
self-consistency problem. To sidestep both issues, in this article we chose to focus on the last 
time step (e.g., current step for the agent) and approximated the AU with the agent taking each 
of the four actions possible in that time step with equal likelihood. The resulting utility function 
provided good solutions, but the performance of such a “handicapped” AU did not exceed those 
of the conceptually simpler WL° and was well below that of WL a . 

Finally, a game-theoretic analysis of the utilities provided by the theory of collectives sheds 
light on the reasons for the dramatic improvements obtained over selfish utilities and team games: 
While the Nash equilibrium of the system in which each agent pursues a selfish goal does not 
correspond to a globally good solution, the global optimum is indeed a Nash equilibrium of the 
system in which agents use collective-based utilities. Furthermore, such utilities provide “good” 
off-equilibrium signals to lead the system into desirable equilibrium states, whereas systems in 
which all agents use team game utilities fail to reach the desirable states (e.g., Nash equilibrium 
states which also are optima of the world utility) due to the excessive “noise” in the system. 

Acknowledgements: The authors thank David Wolpert for his contributions and insightful 
comments. 
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